Revisiting k-means: New Algorithms via Bayesian Nonparametrics

نویسندگان

  • Brian Kulis
  • Michael I. Jordan
چکیده

Bayesian models offer great flexibility for clustering applications—Bayesian nonparametrics can be used for modeling infinite mixtures, and hierarchical Bayesian models can be utilized for sharing clusters across multiple data sets. For the most part, such flexibility is lacking in classical clustering methods such as k-means. In this paper, we revisit the k-means clustering algorithm from a Bayesian nonparametric viewpoint. Inspired by the asymptotic connection between kmeans and mixtures of Gaussians, we show that a Gibbs sampling algorithm for the Dirichlet process mixture approaches a hard clustering algorithm in the limit, and further that the resulting algorithm monotonically minimizes an elegant underlying k-means-like clustering objective that includes a penalty for the number of clusters. We generalize this analysis to the case of clustering multiple data sets through a similar asymptotic argument with the hierarchical Dirichlet process. We also discuss further extensions that highlight the benefits of our analysis: i) a spectral relaxation involving thresholded eigenvectors, and ii) a normalized cut graph clustering algorithm that does not fix the number of clusters in the graph.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multiagent Planning with Bayesian Nonparametric Asymptotics

Autonomous multiagent systems are beginning to see use in complex, changing environments that cannot be completely specified a priori. In order to be adaptive to these environments and avoid the fragility associated with making too many a priori assumptions, autonomous systems must incorporate some form of learning. However, learning techniques themselves often require structural assumptions to...

متن کامل

Exploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics

Exploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics by Yan Shang Department of Statistical Science Duke University

متن کامل

Non-parametric Power-law Data Clustering

It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the k-means clustering procedure. This approach shows simplicity in implementation and solidity in t...

متن کامل

708 , Spring 2014 19 : Bayesian Nonparametrics : Dirichlet Processes

In parametric modeling, it is assumed that data can be represented by models using a fixed, finite number of parameters. Examples of parametric models include clusters of K Gaussians and polynomial regression models. In many problems, determining the number of parameters a priori is difficult; for example, selecting the number of clusters in a cluster model, the number of segments in an image s...

متن کامل

Distributional properties of means of random probability measures

The present paper provides a review of the results concerning distributional properties of means of random probability measures. Our interest in this topic has originated from inferential problems in Bayesian Nonparametrics. Nonetheless, it is worth noting that these random quantities play an important role in seemingly unrelated areas of research. In fact, there is a wealth of contributions bo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1111.0352  شماره 

صفحات  -

تاریخ انتشار 2012